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Abstract 


Understanding the mechanisms of learning is one of the cen- 
tral questions of Cognitive Science. Recently Marcus et al. 
showed that seven-month-old infants can learn to recognize 
regularities in simple language-like stimuli. Marcus proposed 
that these results could not be modeled via existing connec- 
tionist systems, and that such learning requires infants to be 
constructing rules containing algebraic variables. This paper 
proposes a third possibility: that such learning can be ex- 
plained via structural alignment processes operating over 
structured representations. We demonstrate the plausibility of 
this approach by describing a simulation, built out of previ- 
ously tested models of symbolic similarity processing, that . 
modeis the Marcus data. Unlike existing connectionist simu- 
lations, our model learns within the span of stimuli presented 
to the infants and does not require supervision. It can handle 
input with and without noise. Contrary to Marcus’ proposal, 
our model does not require the introduction of variables. It 

' incrementally abstracts structural regularities, which do not 
need to be fully abstract rules for the phenomenon to appear. 
Our model also proposes a processing explanation for why in- 
fants attend longer to the novel stimuli. We describe our 
model and the simulation results and discuss the role of struc- 
tural alignment in the development of abstract patterns and 
rules. 


Introduction 


Understanding the mechanisms of learning is one of the cen- 
tral questions of cognitive science. Recent studies (Gomez & 
Gerken, 1999; Marcus, Vijayan, Rao & Vishton, 1999) have 
shown that showed that infants as young as seven months 
can process simple language-like stimuli and build generali- 
zations sufficient to distinguish familiar from unfamiliar 
patterns in novel test stimuli. In Marcus et al’s study, the 
stimuli were simple ‘sentences,’ each consisting of three 
nonsense consonant-vowel ‘words’ (e.g., ‘ba’, ‘go’, ‘ka’). 
All habituation stimuli had a shared grammar, either ABA or 
ABB. In ABA-type stimuli the first and the third word are 
the same: e.g, ‘pa-ti-pa.’ In ABB-type stimuli the second 
and the third word are identical: e.g., ‘le-di-di’. The infants 
were habituated on 16 such sentences, with three repetitions 
for each sentence. The infants were then tested on a different 


set of sentences that consisted of entirely new words. Half of 
the test stimuli followed the same grammar as in the habitua- 
tion phase; the other half followed the nontrained grammar. 
Marcus et al. found that the infants dishabituated signifi- 
cantly more often to sentences in the nontrained pattern 
than to sentences in the trained pattern. 

Based on these findings Marcus et al. proposed that in- 
fants had learned abstract algebraic rules. They noted that 
these results cannot be accounted for solely by statistical 
mechanisms that track transitional probabilities. They fur- 
ther argue that their results challenge connectionist models 
of human learning that use similar information, on two 
grounds: (1) the infants learn in many fewer trials than are 
typically needed by connectionist learning systems; (2) more 
importantly, the infants learn without feedback. In particular, 
Marcus et al. demonstrated that a simple recurrent network 
with the same input stimuli could not model this learning 
task. 

In response, several connectionist models have attempted 
to simulate these findings. Unfortunately, all of them to date 
include extra assumptions that make them a relatively poor 
fit for the Marcus et al experiment. For example, Elman 
(1999; Seidenberg & Elman, 1999) use massive pre-training 
(50,000 trials) to teach the network the individual stimuli. 
More importantly, they turn the infants’ unsupervised learn- 
ing task into a supervised learning task by providing the 
network with external training signals. Other models tailored 
to capture the data of the study seem unlikely to be applica- 
ble to other similar cognitive tasks (Altmann & Dienes, 
1999). Using a localist temporal binding scheme, Shastri 
and Chang (1999) model the infant results without pretrain- 
ing and without supervision, but still require an order of 
magnitude more exposure to the stimuli than the infants re- 
ceived. 

We propose a third alternative. There is evidence that 
structural alignment processes operating over symbolic 
structured representations participate in a number of cogni- 
tive processes, including analogy and similarity (Gentner, 
1983), categorization (Markman & Gentner, 1993), detec- 
tion of symmetry and regularity (Ferguson, 1994), and learn- 
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ing and transfer (Gentner & Medina, 1998). Although these 
representations and processes are symbolic, they do not need 
to be rule-like, nor need they involve variables. Instead, we 
view the notion of correspondence in structural alignment as 
an interesting cognitive precursor to the notion of variable 
binding’. Correspondences between structured representa- 
tions can support the projection of inferences, as the analogy 
literature shows, and therefore a symbolic system can draw 
inferences about novel situations even without having con- 
structed rules. Moreover, as discussed below, comparison 
can be used to construct conservative generalizations. 
Across a series of items with common structure such a proc- 
ess of progressive abstraction can eventually lead to abstract 
rule-like knowledge. The attainment of rules, in those cases 
where it occurs, is the result of a gradual process. As we 
will show, symbolic descriptions can be used with structural 
alignment to model learning that is initially conservative, but 
which occurs fast enough to be psychologjcally realistic. 

We first describe our simulation model of the Marcus et al 
task, which uses a simple combination of preexisting simula- 
tion modules, i.e., SME, MAGI, and SEQL. All of these 
modules have been independently tested against pychologi- 
cal data and independently motivated in prior modeling 
work. With the exception of domain-specific encoding pro- 
cedures, no new processing components were created for 
this task. We then describe the results of our simulation of 
the Marcus et al data, showing that our simulation can learn 
the concepts within the number of trials that the infants had, 
without supervision and without prelearning. We also show 
that the simulation can exhibit the same results with noisy 
input data. Finally, we discuss some of the implications of 
the symbolic similarity approach for models of cognitive 
processing. ; 


Modeling infant learning via structural 
alignment 


A psychological model of the infants’ learning must in- 
clude the kind of input, the way the infants are assumed to 
encode the individual sentences, and the processes by which 
they generalize across the sentences. The architecture of our 
simulation is shown in Figure 1. We first describe our as- 
sumptions concerning the infants’ processing capacities. 
Then we describe each component in turn. 

Processing Assumptions: We assume that infants can 
represent the temporal order within the sentences (Saffran, 
Aslin & Newport, 1996). We further assume that the infants 
notice and encode identities within the sentences: for exam- 
ple, the fact that the last two elements match in an ABB sen- 
tence. This assumption is consistent with evidence that hu- 
man infants, as well as with studies of nonhuman primates 
(Oden et al, in press), can detect identities. We also assume 
that infants can detect similarities between sequentially pre- 
sented stimuli, consistent with studies of infant habituation, 
which demonstrate that infants respond to sequential same- 
ness (e.g., Baillargeon, 1994). 


' That structure-mapping algorithm neither subsumes, nor is 
subsumed by, traditional pattern matching such as unification is 
shown in Falkenhainer, Forbus, & Gentner (1988). 
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Figure 1: Simulation Architecture 


Input stimuli: To make our simulation comparable with 
others, we use a representation similar to that of Elman 
(1999), namely, Plunkett & Marchman’s (1993) distinctive 
feature notation. Each word has twelve phonetic features, 
which can be either present or absent. The presence or ab- 
sence of each feature for each word is encoded by symbolic 
assertions. If feature n is present for word w, the assertion 
(Rn w) is included in the stimulus, and if absent, the asser- 
tion (Sn w) is included. Thus the acoustic features of each 
word are encoded as twelve attribute statements. 

We modeled the Marcus et al experiment both without 
noise (Experiment 1) and with noise (Experiment 2). Marcus 
et al. used a speech synthesizer to control the pronunciation 
of the stimuli, but while this reduces variability, it cannot 
eliminate the possibility that the infant might encode sone- 
thing incorrectly. 


Temporal encoding: We assume that the infant encodes 
the temporal sequence of the words in a sentence in two 
ways. First, each incoming word has an attribute associated 
with it, corresponding to the order in which it appears (i.e., 
FIRST, SECOND, or THIRD). We further assume that the 
infant encodes temporal relationships between the words in 
a sentence:; to code this, an AFTER relation is added be- 
tween pairs of words in the same sentence indicating their 
relative temporal ordering. The particular labels used in this 
encoding step are irrelevant — there are no rules in the sys- 
tem that operate on these specific predicates — the point is 
simply that infants are encoding the temporal order of words 
within sentences. 

Regularity Encoding: We assume that the infants notice 
and encode identities within the sentences: for example, the 
fact that the last two elements match in an ABB sentence. 
Thus the simulation must incorporate a process that detects 
when words are the same. We use the MAGI model of sym- 
metry and regularity detection (Ferguson, 1994) to auto- 
matically compute these relationships. MAGI treats symme- 
try as a kind of self-similarity, using a modified version of 
structure-mapping’s constraints to guide the selfalignment 
process. MAGI has been successfully used with inputs rang- 
ing from stories to mathematical equations to visual stimuli, 


and it has done well at modeling certain aspects of visual 
symmetry, including making new predictions (Fergusonet al 
1996). Here MAGI is used on the collection of words in a 
sentence. For any pair of words w/ and w2 that MAGI finds 
sufficiently similar, this module asserts (SIM w/ w2), and a 
DIFF statement for every other pair of words in the sen- 
tence. (If MAGI does not find any pairs similar, DIFF 
statements are asserted for every pair of words.) This mod- 
ule also asserts (GROUP w/ w2) for pairs of similar words, 
to mark that they form a substructure in the stimulus, and 
adds DIFF statements between groups and words not in the 
group. This use of MAGI is an example of what Ferguson 
(1994, in preparation) calls analogical encoding. 


SEQL 

Once each sentence is encoded, we assume infants can de- 
tect the similarities between sequential pairs of sentences. 
The detection of structurally parallel patterns across a se- 
quence of examples is modeled by SEQL (Skorstad, Gentner 
& Medin, 1988; Kuehne, Forbus, Gentner & Quinn, 2000), a 
model of the process of category learning from examples. 
SEQL constructs category descriptions via incremental ab- 
straction. That is, the representation of a category is a struc- 
tured description that has been generated by successive 
comparison with incoming exemplars. If the new exemplar 
and the category are sufficiently similar, the category de- 
scription is modified to be their intersection -- i.e., the com- 


monalities computed via structural alignment by a generali- . 


_zation algorithm. If the new exemplar is not sufficiently 
similar, it is stored separately and may later be used as the 
seed of a new category. 

The structural alignment process is implemented via 
SME, (Falkenhainer et al 1988; Forbus et al 1994) a cogni- 
tive simulation of analogical matching. Here the base ck- 
scription is a category description, and the target description 
is the new exemplar. The structural alignments that SME 
computes are used in three ways by SEQL. First, the m- 
merical structural evaluation score it computes” is used as a 
similarity metric, a numerical measure for deciding whether 
or not two descriptions are sufficiently similar. Second, the 
candidate inferences it computes serve as a model for cate- 
gory-based induction (c.f. Blok & Gentner, 2000; Forbus, 
Gentner, Everett, & Wu, 1997). Third, the correspondences 
in the best mapping SME produces serves as the basis for 
SEQL’s generalization algorithm. 

SEQL maintains a set of generalizations and a set of sin- 
gular exemplars. When a new exemplar comes in, it is com- 
pared against existing generalizations to see if it can be as- 
similated into one of them. Otherwise, it is compared with 
the stored exemplars to see if a new generalization can be 
formed. If it is insufficiently similar to both the generaliza- 
tions and the stored exemplars, it is stored as an exemplar 
itself. 

SEQL begins with no generalizations; it simply stores its 
first exemplar. If the next exemplar is sufficiently close to 
the first, their overlap is stored as the first generalization. A 


2 Although SME can compute multiple mappings, we use the 
structural evaluation score of the best mapping, normalized by the 
size of the base description. 


generalization consists of the overlap between the two input 
descriptions: that is, the shared structure found by align- 
ment. Thus generalizations are structured descriptions of the 
same type as the input descriptions, although containing 
fewer specific features. If a new exemplar is sufficiently 
similar to a generalization (as determined comparing the 
structural evaluation score to a set threshold), then (a) the 
generalization is updated by retaining only the overlapping 
description that forms the alignment between the generaliza- 
tion and the exemplar; and (b) candidate inferences are pro- 
jected from the generalization to the exemplar. Non- 
overlapping aspects of a description (e.g., phonetic features 
or relations that aren’t shared) are thus “worn away” with 
each new assimilated description. (The threshold that de- 
termines when descriptions are sufficiently similar to be 
assimilated helps prevent descriptions from diminishing into 
vacuity.) 

Returning now to the infant studies, we assume that babies 
are carrying out an ongoing process of comparing and align- 
ing the incoming exemphrs with an evolving generalization. 
We further assume that the relational candidate inferences 
from the general pattern to a new exemplar represent expec- 
tations on part of the infant.’ When these expectations are 
violated by an incoming stimulus that does not fit the gena- 
alized pattern (e.g., an ABB test sentence after the ABA 
generalization has been formed), we assume the infant e- 
quires extra time to process the inconsistent stimulus. 


Simulation Experiments 


In both experiments, we followed the procedure of Mar- 
cus et al. Each stimulus was a simple three-word sentence, 
encoded as described earlier. There were two sets of train- 
ing stimuli, one following the ABA pattern and one follov- 
ing the ABB pattern. The training stimuli were (ABA) de- 
di-de, de-je-de, de-li-de, de-we-de, ji-di-ji, ji-je-ji, ji-li-ji, ji- 
we-ji, le-di-le, le-je-le, le-li-le, le-we-le, wi-di-wi, wi-je-wi, 
wi-li-wi, wi-we-wi and (ABB) de-di-di, de-je-je, de-li-li, de- 
we-we, ji-di-di, ji-je-je, ji-li-li, ji-we-we, le-di-di, le-je-je, le- 
li-li, le-we-we, wi-di-di, wi-je-je, wi-li-li, wi-we-we. The 
test stimuli in both experiments were four descriptions rep- 
resenting two novel ABA-type (ba-po-ba, ko-ga-ko) and two 
novel ABB-type sentences (ba-po-po, ko-ga-ga). The 
threshold value for SEQL was set to 0.85 in both experi- 
ments. 


Experiment 1 


This experiment is most comparable to previous simula- 
tion models of the phenomena, in that we assume noisefree 
encoding of the stimuli. A simulation run consists of expos- 
ing SEQL to all of the stimuli from a particular training set 
(either ABA or ABB) once and then seeing the response 
given the four test sentences. To avoid possible biasing due 
to sequence effects (See Kuehne et al., 2000), 20 simulation 
runs were made for each training set using different random 


3 SME can also produce attribute-level candidate inferences, and 
does so on these stimuli. We assume that, since these inferences 
concern directly perceivable features, testing them takes very little 
time. 


_| Ko-ga-ko 0.350 (SIM gal ko2) 
(DIFF kol (GROUP gal 
ko2 


orders. Identical match score and relational candidate infer- 
ences were produced for all sequences with a given stimulus 
set. In each case, SEQL produced a single generalization 
during the learning phase. For the test phase we used encod- 
ings of the corresponding stimuli used with infants, as noted 
above. Tables 1a and 1b show the results of this series for 
two generalizations paired against the four test sentences. 


Table ta: ABA training stimuli 
Match Candidate 
Stimulus Score Inferences 

Ba-po-ba_ | CT Nome —~—CSSSSCSC*d 
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Ba-po-po 0.486 (DIFF pol bal) 
(DIFF pol po2) 
SIM bal po2) 
Ko-ga-ga 0.455 (DIFF gal kol) 
(DIFF gal ga2) 
SIM kol gal 
Table 1b: ABB training stimuli 
Stimulus Score Inferences 
Ba-po-ba 0.328 (SIM pol ba2) 
(DIFF bal (GROUP pol 
ba2 


Test 


[Ba-po-po |. | None 
[Ko-ga-ga |. | None 


The in-grammar (bold) and out-of-grammar (plain text) 
matches show clear differences in their match scores. In 
grammar matches are above 0.64 and do not generate rela- 
tional candidate inferences. Out-of-grammar matches have 
match scores below 0.5, and lead to relational candidate 
inferences. Thus out-of-grammar test sentences lead to 
longer looking behavior, as predicted. 


Experiment 2 


As noted earlier, we believe that noise-free stimulus en- 
codings are unrealistic. Consequently, we used the same 
procedure as Experiment 1, but this time introducing noise 
into the representations for the training and test stimuli. For 
each sentence, one of the words was randomly picked, and 
one of its attributes (also chosen at random) was dropped or 
flipped, with the rest of its description being unchanged. 
Such changes can be significant: for example, flipping a 
single phonetic feature turns the word ‘de’ into the word 
‘di’. Again, 20 simulation runs were made for each training 
set using different random orders. Naturally the match 
scores and, to a lesser degree, the generated candidate infer- 
ences, did vary across the individual runs. Tables 2a and 2b 
show the results. The scores were averaged over all 20 runs. 

Although the noise affected the details of the computa- 
tions, the overall pattern of results remains the same. The 
in-grammar (bold) match scores are far higher than the out- 
of-grammar (plain text) scores; and the out-of-grammar 


stimuli produce relational candidate inferences while the in- 
grammar stimuli do not. 


Table 2a: ABA training stimuli 
Candidate 
Inferences 
Min, Average, Max 


Candidate 
Inferences 
Min, Average, Max 


Stimulus 


Match 
Score 


0.339 


ko-ga-ko 0.352 
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Comparison with other models 


The results of Marcus et al. (1999) have sparked an active 
debate focused on two issues: (1) Can current connectionist 
models (e.g., simple recurrent networks) model these r- 
sults? (2) Do infants generate abstract rules that include 
variables? 

Regarding the adequacy of simple recurrent networks, 
Marcus et al. state “Such networks can simulate knowledge 
of grammatical rules only by being consequently trained on 
all items to which they apply; consequently, such mecla- 
nisms cannot account for how humans generalize rules to 
new items that do not overlap with the items that appeared in 
the training.” Elman’s (1999) response describes his use of 
a simple recurrent network to model this task. Elman’s 
model requires tens of thousands of training trials on the 
individual syllables, and treats the problem.as a supervised 
learning task, unlike the task facing the infants. By contrast, 
our simulation handles the learning task unsupervised, and 
produces humar-like results with only exposure to stimuli 
equivalent to that given to the infants. Moreover, our model 
also continues to work with noisy data, something not true of 
any other published model of this phenomenon that we know 
of. 

The learning in our model is due to the “wearing away” of 
non-identical phonetic attributes through subsequent com- 
parisons. Although SEQL’s learning proceeds faster than 
connectionist models, it is still slower than systems that gen- 
erate abstractions immediately (e.g., explanation-based 
learning (DeJong & Mooney, 1986)). In SEQL’s progres- 
sive alignment algorithm, the entities in the generalizations 
lose their concrete attributes across multiple comparisons, 
leaving the relational pattern of each grammar as the dom- 
nant force in the generalization only after a reasonable num- 


ber of varied examples are seen.’ There is considerable evi- 
dence for this kind of conservative learning (Forbus & 
Gentner, 1986; Medin & Ross, 1989). 

Turning to the second issue, whether infants have vari- 
ables and generate abstract rules, Marcus et al (1999) claims 
“[I]nfants extract abstract algebra-like rules that represents 
relationships between placeholders (variables), such as ‘the 
first item X is the same as the third item Y,’ or more gene- 
ally that ‘item I is the same as item J.’” But our simulation 
does not introduce variables, in the sense commonly used in 
mathematics or logic. The generalizations constructed by 
SEQL do indeed include relational patterns that survive re- 
peated comparisons because they are shared across the in- 
grammar exemplars. Furthermore, the entities (words) in the 
generalizations have many fewer features than the original 
words, as a result of the wearing away of features in succes- 
sive comparisons. One could consider these patterns as a 
form of psychological rule, as proposed by Gentner and 
Medina (1998), with the proviso that the elements in the rule 
are not fully abstract variables, although they might asymp- 
totically approach pure variables. 


Discussion 


This paper proposes a third kind of explanation for the in- 
fant learning phenomena of Marcus et al (1999): incremental 
abstraction of symbolic descriptions via structural align- 
ment. We believe our explanation is currently the best one 
for three reasons. First, it models the infant data with fewer 
extra concessions than previously published models (i.e., no 
pre-training, no supervision, and noisy data). Second, the 
processes we postulate are cognitively general; they apply to 
a large set of phenomena. Third, the abstraction processes 
we propose are consistent with research demonstrating that 
human learning is initially conservative (Brooks, 1987; Fa- 
bus & Gentner, 1986; Medin & Ross, 1989). Interestingly, 
there is ongoing research in developing symbolic conne- 
tionist models consistent with these processes (e.g., Holyoak 
& Hummel, 1997). 

Many issues remain to be explored. For example, al- 
though our system does not introduce variables in its gena- 
alization process, there is a sense in which the entities in the 
generalization are on their way to becoming variables. Gent- 
ner and Medina (1998) have proposed that the process of 
progressive alignment can lead to rules. They further sug- 
gested that the application of rules to instances can be ac- 
complished using the same general processes of strictural 
alignment and projection that are used in analogy. The di- 
ference is that the base domain is an abstraction, the entities 
are ‘dummies’ with no features to either help or impede the 
match with the specific entities in the exemplar. Another 
issue concerns the incorporation of statistical notions in 
SEQL. Although SEQL is to a certain degree noise-resistant, 


4 SEQL learns with only one exposure to the 16 learning sen- 
tences, whereas Marcus’s infants received three exposures for each 
sentence. It is possible that the infants would have learned with 
only one pass; however it is also possible that the infants were less 
consistent in detecting the similarities than our simulation with its 
current parameters. 


we suspect that to model large-scale learning, it will need to 
keep track of more statistical information than it does cuw- 
rently, so that properties wear away more slowly. 

We note that it is common to conflate symbolic process- 
ing with rule-based behavior, and parallel processing with 
connectionist models. The model described here is sym- 
bolic, but it need not involve variables or rules. Further, it 
involves extensive parallel processing (most of SME and 
MAGI’s computations are parallel). Given the complexity 
of the phenomena, such conflations seem unwise. 

The debates stirred by the Marcus et al. results bear on a 
critical issue in human learning and development: namely, 
what knowledge or mechanisms must be assumed to acount 
for the rapid and powerful achievements demastrated by 
infants in both cognition and language. Our results suggest 
that the general learning mechanism of structure-mapping 
theory may go a long way in accounting for these 
accomplishments. 
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